Differential Temporal Difference Learning
نویسندگان
چکیده
Value functions derived from Markov decision processes arise as a central component of algorithms well performance metrics in many statistics and engineering applications machine learning. Computation the solution to associated Bellman equations is challenging most practical cases interest. A popular class approximation techniques, known temporal difference (TD) learning algorithms, are an important subclass general reinforcement methods. The introduced this article intended resolve two well-known issues with TD-learning algorithms. Their slow convergence due very high limit theorem variance, fact that, for problem computing relative value function, consistent exist only special cases. First we show that gradients these admit representation lends itself algorithm design. Based on result, new differential TD-learning introduced. For Markovian models Euclidean space smooth dynamics, shown be under conditions. Numerical results dramatic variance reduction comparison standard
منابع مشابه
Dual Temporal Difference Learning
Recently, researchers have investigated novel dual representations as a basis for dynamic programming and reinforcement learning algorithms. Although the convergence properties of classical dynamic programming algorithms have been established for dual representations, temporal difference learning algorithms have not yet been analyzed. In this paper, we study the convergence properties of tempor...
متن کاملPreconditioned Temporal Difference Learning
LSTD is numerically instable for some ergodic Markov chains with preferred visits among some states over the remaining ones. Because the matrix that LSTD accumulates has large condition numbers. In this paper, we propose a variant of temporal difference learning with high data efficiency. A class of preconditioned temporal difference learning algorithms are also proposed to speed up the new met...
متن کاملEmphatic Temporal-Difference Learning
Emphatic algorithms are temporal-difference learning algorithms that change their effective state distribution by selectively emphasizing and de-emphasizing their updates on different time steps. Recent works by Sutton, Mahmood and White (2015), and Yu (2015) show that by varying the emphasis in a particular way, these algorithms become stable and convergent under off-policy training with linea...
متن کاملNatural Temporal Difference Learning
In this paper we investigate the application of natural gradient descent to Bellman error based reinforcement learning algorithms. This combination is interesting because natural gradient descent is invariant to the parameterization of the value function. This invariance property means that natural gradient descent adapts its update directions to correct for poorly conditioned representations. ...
متن کاملQuasi Newton Temporal Difference Learning
Fast convergent and computationally inexpensive policy evaluation is an essential part of reinforcement learning algorithms based on policy iteration. Algorithms such as LSTD, LSPE, FPKF and NTD, have faster convergence rates but they are computationally slow. On the other hand, there are algorithms that are computationally fast but with slower convergence rate, among them are TD, RG, GTD2 and ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: IEEE Transactions on Automatic Control
سال: 2021
ISSN: ['0018-9286', '1558-2523', '2334-3303']
DOI: https://doi.org/10.1109/tac.2020.3033417